Global Startup Analysis

Advanced R Programming - Final Project

Chittilla Venkata Somanath

2026-02-08

Introduction & Research Questions

  • The “Unicorn” Landscape: This project analyzes private startups valued at $1 billion or more. Research Question 1: Which industries achieve the highest valuations, and how does this vary by geography?

Research Question 2: Is there a correlation between years taken to reach Unicorn status and current valuation? Dataset: unicorn_companies.csv containing data on 1,074 companieS. —

Data Exploration

Initial inspection revealed 1,037 rows and 13 columns.

  • Data Types: All columns were initially read as characters (<chr>).

  • Cleaning Needs:

  • Valuations: Contained “$” symbols.

  • Founded Year: Contained “None” strings.

  • Dates: Required parsing into date-time objects.

Data Cleaning & Feature Engineering

A robust pipeline was implemented to prepare the data for analysis .

library(tidyverse)  # Core manipulation
library(lubridate)  # Date handling
library(plotly)     # Interactive plots
library(patchwork)  # Combining plots
library(scales)     # Formatting axes
library(DT)         # Interactive tables
# Load the data
df <- read_csv("unicorn_companies.csv")
df_clean <- df |>
  rename(Valuation_B = `Valuation ($B)`, 
         Date_Joined = `Date Joined`, 
         Founded_Year = `Founded Year`) |>
  mutate(
    Valuation_B = as.numeric(str_remove(Valuation_B, "\\$")),
    Date_Joined = parse_date_time(Date_Joined, orders = c("mdy", "dmy", "ymd")),
    Founded_Year = as.numeric(na_if(as.character(Founded_Year), "None"))
  )

Advanced Imputation (.by)

Missing founding years were filled using the industry-specific median to maintain data integrity.

df_clean <- df_clean |>
  mutate(Founded_Year = if_else(is.na(Founded_Year),
                                median(Founded_Year, na.rm = TRUE),
                                Founded_Year),
         .by = Industry) |>
  mutate(Join_Year = year(Date_Joined),
         Years_to_Unicorn = Join_Year - Founded_Year) |>
  filter(Years_to_Unicorn >= 0)

Industry & Valuation Insights

The analysis identified clear trends in sector performance and geographic dominance .

  • Top Hubs: USA, China, and India account for the majority of the global unicorn population.

  • Fintech Dominance: Fintech has the highest count of unicorns and a high average valuation.

  • AI Efficiency: The Artificial Intelligence sector represents the highest “Growth Rate” (valuation appreciation per year).

    Visualizations

    # SETUP & DATA CLEANING
    library(tidyverse)
    library(lubridate)
    library(plotly)
    library(DT)
    library(patchwork)
    library(scales)
    
    # Load and clean data directly from the project logic
    df <- read_csv("unicorn_companies.csv")
    
    df_clean <- df |>
      rename(Valuation_B = `Valuation ($B)`, 
             Date_Joined = `Date Joined`, 
             Founded_Year = `Founded Year`) |>
      mutate(
        Valuation_B = as.numeric(str_remove(Valuation_B, "\\$")),
        Date_Joined = parse_date_time(Date_Joined, orders = c("mdy", "dmy", "ymd")),
        Founded_Year = as.numeric(na_if(as.character(Founded_Year), "None"))
      ) |>
      # Advanced Imputation using industry median (.by requirement)
      mutate(Founded_Year = if_else(is.na(Founded_Year), 
                                    median(Founded_Year, na.rm = TRUE), 
                                    Founded_Year), .by = Industry) |>
      mutate(Join_Year = year(Date_Joined),
             Years_to_Unicorn = Join_Year - Founded_Year,
             Is_US = if_else(Country == "United States", "US", "International")) |>
      filter(Years_to_Unicorn >= 0)
    
    # --- 2. MULTI-PANEL DASHBOARD (GGPPLOT2 & PATCHWORK) ---
    
    # Plot A: Valuation vs. Speed to Unicorn
    p1 <- ggplot(df_clean, aes(x = Years_to_Unicorn, y = Valuation_B, color = Industry)) +
      geom_point(alpha = 0.5) +
      geom_smooth(method = "lm", color = "black", se = FALSE) +
      scale_y_log10(labels = label_dollar()) +
      labs(title = "Valuation vs. Scaling Speed", x = "Years to Unicorn", y = "Valuation ($B)") +
      theme_minimal() + theme(legend.position = "none")
    
    # Plot B: Top 5 Industries by Company Count
    p2 <- df_clean |> 
      count(Industry) |> 
      slice_max(n, n = 5) |> 
      ggplot(aes(x = reorder(Industry, n), y = n, fill = Industry)) +
      geom_col() + coord_flip() +
      labs(title = "Top 5 Growth Industries", x = "", y = "Count") +
      theme_minimal() + theme(legend.position = "none")
    
    # Plot C: Global Unicorn Growth Over Time
    p3 <- df_clean |> 
      count(Join_Year) |> 
      ggplot(aes(x = Join_Year, y = n)) +
      geom_line(linewidth = 1, color = "steelblue") +
      geom_point() +
      labs(title = "Unicorn Emergence by Year", x = "Year Joined", y = "New Unicorns") +
      theme_minimal()
    
    # Plot D: Valuation Spread (US vs. International)
    p4 <- ggplot(df_clean, aes(x = Is_US, y = Valuation_B, fill = Is_US)) +
      geom_boxplot() +
      scale_y_log10(labels = label_dollar()) +
      labs(title = "Market Valuation: US vs. Intl", x = "", y = "Valuation ($B)") +
      theme_minimal() + theme(legend.position = "none")
    
    # Merge into unified Dashboard
    (p1 | p2) / (p3 | p4) + plot_annotation(title = "Global Unicorn Ecosystem Analysis")

    # --- 3. GEOGRAPHIC DISTRIBUTION (FACETED HISTOGRAM) ---
    
    p5 <- df_clean |> 
      filter(Country %in% c("United States", "China", "India")) |> 
      ggplot(aes(x = Valuation_B, fill = Country)) + 
      geom_histogram(bins = 20, color = "white", boundary = 0) + 
      facet_wrap(~Country, scales = "free_y") + 
      scale_x_log10(labels = label_dollar()) +
      labs(title = "Valuation Distribution: Top 3 Hubs", x = "Valuation ($B)", y = "Count") + 
      theme_minimal() + theme(legend.position = "none")
    
    print(p5)

    # --- 4. INTERACTIVE TABLE (DT) ---
    
    datatable(df_clean |> select(Company, Industry, Valuation_B, Country, Years_to_Unicorn),
              filter = 'top',
              options = list(pageLength = 10),
              caption = 'Filterable Global Unicorn Dataset')

    Growth Efficiency Analysis

  • Hypothesis: “Blitz-scaling” (reaching $1B faster) leads to higher valuations.

  • Finding: A slight negative correlation exists.

    Quantitative Data:

  • Fast Track (<= 3 yrs): Higher average valuation.

  • Slow Track (> 10 yrs): Lower average valuation .

Global Unicorn Dashboard

A multi-panel dashboard was constructed using patchwork to visualize the ecosystem.

  • Panel 1: Valuation vs. Years to Unicorn (Scatter) .

  • Panel 2: Top 5 Industries by Count (Bar).

  • Panel 3: Unicorn Growth Over Time (Line).

  • Panel 4: Valuation Spread: US vs. International (Boxplot) .

Conclusions & Limitations

  • Geographic Hubs: USA dominates Software/Services; China leads in Hardware and AI .

  • Speed Matters: The “first-mover advantage” is supported by higher valuations for fast-scaling companies.

    Academic Integrity:

    1. Survivor Bias: Dataset only includes companies that reached $1B.

    2. Static Valuations: “Paper values” may not reflect current liquid value in volatile markets .se